When a human being starts learning a language, he starts with lexicography (characters and numbers) and gradually learns how to construct words, phrases, sentences and more. Gradually we learn that the word “bank” has different meanings in “bank of the river” and “money in the bank”. But how can we make a computer learn the difference?
Until the last decade, computer scientists were using complex linguistic rules, grammar, and probabilistic models to solve complex problems like sentiment analysis and opinion mining. Since 2010, there has been a monumental shift in the field of natural language processing (NLP). In 2013 Google introduced Word2Vec, a model that represents words in terms of vectors, that preserves the distance between similar words.
Figure 1: Word2Vec (Image ref - https://www.smartcat.io/blog/2017/word2vec-the-world-of-word-vectors/)
Word2Vec was trained by two different architectures:
Continuous Bag of Words (CBOW)
In this architecture, surrounding content words are fed into a neural network to predict the next word.
Skip-gram
In skip-gram, individual words are used to predict the contextual words, i.e. previous and next words.
Take the example of sample text “a quick brown fox jumps over the wall”. In CBOW architecture, we use “quick”, “brown”, “jumps” and “over” to predict the intermediate word “fox”. In Skip-gram however, we use “quick” to predict “brown”, “quick” to predict “fox”, “brown” to predict “quick”, “brown” to predict “fox”. During the learning process, the model learns the relationship between contextual words, which leads to better word representation learning
Following Word2Vec, there were many ways to leverage the next word prediction into different contextual learning tasks. Doc2Vec takes a whole sentence as input and generates representation at a sentence level. Lda2Vec’s work combines word representation and semantic representation learned from topics. GloVE model combines word embeddings with statistical information. Stanford’s GloVE is one of the most popular alternatives of Google’s Word2Vec model. Facebook’s FastText model uses character n-grams and an efficient learning process to learn embeddings for out of the vocabulary words as well.
Before the inception of word embeddings, most NLP systems used CBOW (bag of words) representation for semantic analysis. In the CBOW model, a text is represented as a vector of word counts.
Although this representation leads to statistically significant results, it has many drawbacks that were resolved by word embeddings
CBOW representation suffers from sparsity problems in short texts.
Ordering of words within a text is not considered in the CBOW model, which makes it difficult to understand the context. Also, capturing the semantic difference between the two texts is very difficult with this approach.
CBOW model often assumes a fixed vocabulary. Representation fails if a new word comes in.
Another advantage of the Word2Vec model is robust pre-training. Word2Vec was trained on a huge Wikipedia text corpus. Due to this, many neural network based NLP models use Word2Vec representations as default embedding weights to save time and cost. Further, Word2Vec can preserve distance between semantically similar words.
Madrid – Spain = Rome - Italy
This makes it an efficient model for various NLP tasks like information retrieval, query resolution, text similarity calculation, etc.
Word2Vec has a few drawbacks that were taken care of by the subsequent works.
Training data preparation for the naïve Word2Vec model was a critical challenge due to the size of the vocabulary. This was resolved by techniques like negative sampling and hierarchical softmax.
As mentioned above, Word2Vec learns static embeddings for each word, i.e. the same word under two different contexts (like the word “bank” in “bank of the river” and “money in the bank”) will not have different embeddings. Embeddings learned from language models – ELMO, ULMfit, BERT can overcome this challenge.
Modern day deep language models use autoregressive nature of texts to learn natural language structures. These methods can distinguish similar words under different contexts and learn their embeddings more accurately.
Nevertheless, Word2Vec was one of the most prominent inventions in the current decade which has opened the door to many subsequent state of the art models, which have led to a monumental shift in recent day NLP.
Below, we demonstrate the Word2Vec model using Python’s Gensim package.
We will use musical instrument reviews http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz to construct Word embeddings. Sample reviews –
“nice windscreen protects my mxl mic and prevents pops. Only thing is that the gooseneck is only marginally able to hold the screen in position and requires careful positioning of the clamp to avoid sagging.”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import gzip
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient = 'index')
df = getDF('reviews_Musical_Instruments_5.json.gz')
We do some preprocessing on the review dataset before using Word2Vec.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import nltk
from nltk.stem.porter
import PorterStemmer
stopword_list = nltk.corpus.stopwords.words('english')
st = PorterStemmer()
df.reviewText = df.reviewText.str.lower()
df.reviewText = df.reviewText.apply(lambda x: re.sub(' +', ' ', re.sub(r '[^a-z]', ' ', x)))
df.reviewText = df.reviewText.apply(lambda x: " ".join([st.stem(i) for i in x.split() if i not in stopword_list]))
allreviews = list(df.reviewText)
allreviews = [i.split() for i in allreviews]
from gensim.models.word2vec
import Word2Vec
model = Word2Vec(min_count = 5)
model.build_vocab(allreviews)
model.train(allreviews, total_examples = model.corpus_count, epochs = 5)
#to get the embedding
for the word keyboard
print(model['bass'])
#5 most similar words to keyboard with similarities
print (model.most_similar('bass')[:5])
Top 5 most similar words are – “pickup”, “classical”, “acoustic”, “electric” and “humbucker”. All related to guitars.